EXPLORATION OF RED WINE by PRAVEEN KUMAR

The column names of all the parameters used in the assessmnent of the quality of the red wine is shown below.

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

The summary of the parameters is shown below for all the parameters of red wine.Also shown below are the Minimum,Median,Mean and the Maximum values for all the features or parameters.The value of 1st Quartile and the 3rd quartile is also shown in the summary below.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Univariate Plots Section

The plots below show the univariate plots in the form of histogram of the different parameters.We would use this to see how the variation is for the different features of the red wine. The below plot shows the histogram of the quality,pH value and the Alcohol value.

The below histogram shows the pH value of the red wine.This shows the count of the pH Value in the red wine.

The below histogram shows the Alcohol value of the red wine.This shows the count of the alcohol in the red wine for each alcohol value.

The below histogram shows the Density value of the red wine.This shows the count of the density in the red wine.

The below histogram shows the Sulphate content of the red wine.This shows the count of the sulphates in the red wine.

The below histogram shows the Free Sulfur Dioxide of the red wine.This shows the count of the sulfur dioxide in the red wine.

The below histogram shows the Total Sulfur Dioxide value.This shows the count of the Total sulfur dioxide in the red wine.

The below histogram shows the Fixed Acidity of the red wine.This shows the count of the fixed acidity in the red wine.

The below histogram shows the Volatile acidity of the red wine.This shows the count of the volatile acidity in the red wine.

The below histogram shows the Citric acid of red wine.This shows the count of the Citric acid in the red wine.

The below histogram shows the Residual sugar of red wine.This shows the count of the Residual sugar in the red wine.

The below histogram shows the Chlorides of the red wine.This shows the count of the chlorides in the red wine.

Univariate Analysis

What is the structure of your dataset?

The dataset is a RedWine. There are 1599 observations in the dataset.There are 12 parameters to determine how good the redwine is.The quality of the redwine varies from 3 to 8.The quality has a score which can vary from 0 to 10. The percentage of the alcohol vairies from 8.4 to 14.9. The other factors in measuring the red wine are alcohol content,pH,sulphates,free.sufur.dioxide ,residual.sugar, total sulfur.dioxide,fixed.acidity,volatile.acidity,citric.acid,chlorides and density.

What is/are the main feature(s) of interest in your dataset?

The main feature of the dataset is the quality and the alcohol level of the redwine .

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The other features which could determine the quality of the redwine could be the volatile.acidity , fixed.acidity and the residual sugar.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

In the Bivariate plot we will First plot the correlation matrix graph. Using this graph we would know which feature has more correlation and then we would plot the other plots which shows strong correlation. A strong correlation means which has a correlation coefficient greater than of |0.25|.

The above plot shows the correlation plot for all the parameters. This plot is made to check which of the features has a strong correlation with the quality. From the plot we can see that the correlation of quality with the alcohol, correlation of quality with the volatile.acidity and the correlation of the quality with the sulphates is at the higher level. The correlation of the other parameters with the quality is between -0.25 and 0.25.

We will further check the correlation coeffecient against the features which showed the maximum correlation coefficient. The below plot shows the box plot for the alcohol vs quality.

Below is the correlation coefficient between the alcohol and the quality.This is done to check the correlation coefficient value.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$quality
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

The below plot shows the scatter plot for the volatile.acidity vs quality.The below plot is done to check the variation of the volatile acidity aginst quality since they have a strong correlation.The plot shows that the wines with higher higher alcohol percentage has low volatile acidity.Although there is no direct relationship between the alcohol and volatile acidity,higher alcohol can prevent the fermentation of the yeast in the alcohol as the yeast will die off.

Below is the correlation coefficient between the volatile acidity and the quality of the wine.There is a negative correlation between them which means lower the volatile acidity better the quality of the wine.The main driving factor here is the acetic acid which is used in the red wine.This when added in small quantity to the red wine can boost the quality of the redwine.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$volatile.acidity and wine$quality
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

The below plot shows the scatter plot for the sulphates vs quality.The below plot is done to see if the quality has any effect on the sulphates since we saw some correlation between them.The lower the sulphates better the quality of the redwine.Low amount of sulfur dioxide are added to the redwine as an preservative.

Below is the correlation coefficient between the sulphates and the quality.There is a correlation between the sulphates and the quality but the correlation is not strong.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$sulphates and wine$quality
## t = 10.3798, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

During the investigation and on the bivariate analysis the scatter plot and pearson correlation was found.The alcohol and the volatile.acidity had strong relationship with the quality of wine.The fixed.acidity had a very thin relationship with the quality of the wine.The residual.sugar didnt have any relationship with the quality.Most of the value was less than 4,so also the chlorides.The pH value is the measure of acidic nature hence this value is similar to the volatile.acidity factor.The sulphates didnt show any strong relationship with the quality hence this is not shown in the graph along with the free.sulfur.dioxide or residual.sulfur.dioxide.The citric.acid also didnt show any relationship with the quality of the wine. ### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? The interesting relationship observed is with the citric.acid and the pH value as shown below in the scatter plot.The more the citric acid the lesser the pH value which is by the property of the pH scale.They have a negative correlation. ### What was the strongest relationship you found? The strongest relationship found in the plots is 1.Between the alcohol and the quality of the wine. 2.Between the volatile.acidity and the quality of the wine.

Multivariate Plots Section

After the BiVariate and the Univariate plot we have come to the conclusion that alcohol,volatile acidity effects the quality the most.In the multivariate plot we will have more plots to see the dependency between these features variation with the quality using different types of plots. The first of the multivariate plot is the line plot of volatile.acidity vs quality showing the alcohol as color.This is done to show the variation of the quality with the volatile.acidity.

The below plot is done to show the alcohol vs volatile.acidity for the different quality range.This shows that the acidity keeps decreasing at higher quality.

This shows the scatter plot of the quality against the volatile.acidity highlighted in the color showing the alcohol percentage.We can see that the at higher quality and lower acidity there are few dark spot showing higher alcohol percentage.

The below plot shows the scatter plot to show the alcohol vs the volatile.acidity colored by quality.This plot may not show much of the information.

The below plot shows the scatter plot for the quality vs the alcohol percentage.Here we can see the color gets darker as the alcohol is higher.

Below plot shows the box plot showing the acidity vs quality.The box plot shows lower and the upper quartile and the minimum and the maximum value of volatile.acidity for all the quality values.

Below is the scatter plot for the quality vs volatile.acidity along with the stat_smooth function.This confirms the fact that the relationship between the alcohol and the quality is negatively correlated as there is a line going with the negative slope.

Below is the scatter plot for the quality vs volatile.acidity along with the stat_smooth function.This confirms the fact that the relationship between the alcohol and the quality is linear and also using the correlation plot we confirmed that there is a positive correlation.

Below is the box plot between the alcohol and the quality.

Below is the box plot for the volatile.acidity vs the quality colored with the alcohol percentage.In the below plot we can see that the for the higher quality redwine the acidity is low and we can see few blue colored box plots in the lower range of acidity levels having the higher alcohol percentage.Hence we can establish from all these plots that the good quality of redwine correlates to higher alcohol and lower acidity levels.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The different relationships between the volatile.acidity,quality and the alcohol is shown above in the multivariate plots.There is a strong relationships between the volatile.acidity and the quality of the redwine. The strong relationship between alcohol and the quality of the redwine is shown in the scatter plot and the regression line which shows a positive regression and also a box plot showing the increasing value of quality as the alcohol is more. The another relationship which is stronger is between the volatile.acidity and the quality. In the scatter plot with the negative regression line indicating that as the volatile.acidity increases the quality decreases.

Were there any interesting or surprising interactions between features?

The relationship between alcohol vs volatile.acidity was flat meaning there was no correlation. ### OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model. Yes. The details are in the code below.The model is taken with two parameters volatile.acidity and the alcohol with 95% CI. Then I used this model to predict the quality of the redwine for the acidity of 0.1 qnd alcohol of 14 which gave me a quality of 7.35 with 8.6 as upper and 6.03 as lower range.

##        fit      lwr      upr
## 1 7.350483 6.034318 8.666648

Final Plots and Summary

Plot One

The below histogram shows the count of the quality of the red wine.

Description One

The above plots show the histogram of the quality of the wine.All the features considered are used to measure the quality of the redwine.This shows that most of the quality is centered around 6.0.

Plot Two

Description Two

The above plot shows the box plot for the feautures volatile.acidity and alcohol used in the red wine.These two features primarily contribute to the quality of the redwine.We can see in the box plot that for the higher quality the alcohol percentage is also higher and the volatile acidity is lower. As we go towards the left side the box plot for alcohol percentage decreases and the volatile.acidity is higher.

Plot Three

Description Three

The above plot shows the scatter plot for the alcohol vs volatile.acidity with quality as the color.This plot clearly shows the relation of the volatile.acidity vs the alcohol colored by the quality.The higher quality red wine has an alcohol in the range of 12-14 with the acidity level below 0.8.This means the good quality red wine will have a higher percentage of alcohol and lower acidity level.


Reflection

The data given was for the redwine data with the different factors which effect the quality of the wine.First the histogram was plotted to check all the variations of the features.Quality is the outcome what we are trying to see from the features.Then a correlation was done to see which of the features had a strong correlation with the output which is quality.None of the features had a strong correlation with the quality.However Volatile.acidity, alcohol had sulphates had a good correlation with the quality of the output.volatile.acidity had a negative correlation with the quality whereas the alcohol had the positive correlation with the quality.The correlation of the sulphates was not strong enough when we checked the pearson correlation between them.The box plots and the multivariate plots was done to substantiate the findings.Then using these two features a model was built and we used this model to predict the quality of the red wine for a given input which gave me a satisfactory output. ##Reflection - Future Work One other factor would be to see if the storage time and place of the red wine would influence the quality of the wine assuming that there is one to one mapping between the quality of the red wine and the price of the red wine.

A list of Web sites, books, forums, blog posts, github repositories, etc. that you referred to or used in creating your submission (add N/A if you did not use any such resources).

docs.ggplot2.org

http://uregina.ca/~gingrich/regr.pdf

http://www.cookbook-r.com/Graphs/Titles_(ggplot2)/